vision and language
From Visual Question Answering to multimodal learning: an interview with Aishwarya Agrawal
You were awarded an Honourable Mention for the 2019 AAAI / ACM SIGAI Doctoral Dissertation Award. What was the topic of your dissertation research, and what were the main contributions or findings? My PhD dissertation was on the topic of Visual Question Answering, called VQA. We proposed the task of open-ended and free-form VQA - a new way to benchmark computer vision models by asking them questions about images. We curated a large-scale dataset for researchers to train and test their models on this task.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > India > Gujarat > Gandhinagar (0.04)
- Personal > Interview (0.65)
- Research Report > New Finding (0.47)
- Personal > Honors > Award (0.35)
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
Deep network models are often purely inductive during both training and inference on unseen data. When these models are used for prediction, but they may fail to capture important semantic information and implicit dependencies within datasets. Recent advancements have shown that combining multiple modalities in large-scale vision and language settings can improve understanding and generalization performance. However, as the model size increases, fine-tuning and deployment become computationally expensive, even for a small number of downstream tasks. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings.
Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
The paper proposes a new method called Mutual Iterative Attention (MIA) for improving the representations used by common visual-question-answering and image captioning models. MIA works by repeated execution of'mutual attention', a computation that is similar to the self-attention operation in the Transformer model, but where the lookup ('query') representation is conditioned by information from the other modality. Importantly, the two modalities involved in the MIA operation are not vision and language, they are vision and'textual concepts' (which they also call'textual words' and'visual words' at various points in the paper). These are actual words referring to objects that can be found in the image. The model that predicts textual concepts (the'visual words' extractor) is trained on the MS-COCO dataset in a separate optimization to the captioning model Applying MIA to a range of models before attempting VQA or captioning tasks improves the scores, in some cases above the state-of-the-art. It is a strength of this paper that the authors apply their method to a wide range of existing models and observe consistent improvements.
Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time
Masala, Mihai, Leordeanu, Marius
Moreover, such models suffer from overfitting such that Transformer-based solutions are the backbone of current once given a video from an unseen context or distribution state-of-the-art methods for language generation, image the quality and accuracy of the description drops, as our and video classification, segmentation, action and object evaluations prove. On the other hand, VLLMs have shown recognition, among many others. Interestingly enough, impressive results, being capable of generating long, rich while these state-of-the-art methods produce impressive results descriptions of videos. Unfortunately VLLMs still share in their respective domains, the problem of understanding some of the same weaknesses as previous methods: they are the relationship between vision and language is largely unexplainable and they still rely on sampling frames still beyond our reach. In this work, we propose a common to process a video. Moreover, top-performing models such ground between vision and language based on events as GPT, Claude or Gemini are not open and are only accessible in space and time in an explainable and programmatic way, via an paid API. to connect learning-based vision and language state of the We argue that one of the main reasons why this interdisciplinary art models and provide a solution to the long standing problem cross-domain task is still far from being solved is of describing videos in natural language. We validate that we still lack an explainable way to bridge this apparently that our algorithmic approach is able to generate coherent, insurmountable gap. Explainability could provide a rich and relevant textual descriptions on videos collected more analytical and stage-wise way to make the transition from a variety of datasets, using both standard metrics (e.g. from vision to language that is both trustworthy and makes Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
Deep network models are often purely inductive during both training and inference on unseen data. When these models are used for prediction, but they may fail to capture important semantic information and implicit dependencies within datasets. Recent advancements have shown that combining multiple modalities in large-scale vision and language settings can improve understanding and generalization performance. However, as the model size increases, fine-tuning and deployment become computationally expensive, even for a small number of downstream tasks. Moreover, it is still unclear how domain or prior modal knowledge can be specified in a backpropagation friendly manner, especially in large-scale and noisy settings.
Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond
Han, Soyeon Caren, Cao, Feiqi, Poon, Josiah, Navigli, Roberto
This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and newcomers with the knowledge and skills to leverage multimodal AI. ACM Multimedia 2024 is the ideal venue for this tutorial, aligning perfectly with our goal of understanding multimodal pretrained and large language models, and their tuning mechanisms.
- Oceania > Australia > Victoria > Melbourne (0.06)
- Oceania > Australia > New South Wales > Sydney (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- (2 more...)
Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond
Zhang, Xiang, Li, Senyu, Wu, Zijun, Shi, Ning
Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex text and image tasks. Numerous prior research endeavors have diligently examined the performance of these Vision Large Language Models (VLLMs) across tasks like object detection, image captioning and others. However, these analyses often focus on evaluating the performance of each modality in isolation, lacking insights into their cross-modal interactions. Specifically, questions concerning whether these vision-language models execute vision and language tasks consistently or independently have remained unanswered. In this study, we draw inspiration from recent investigations into multilingualism and conduct a comprehensive analysis of model's cross-modal interactions. We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting and provide a set of datasets designed for these evaluations. Our findings reveal that models like GPT-4V tend to perform consistently modalities when the tasks are relatively simple. However, the trustworthiness of results derived from the vision modality diminishes as the tasks become more challenging. Expanding on our findings, we introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
- North America > Canada > Alberta (0.15)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia (0.04)
- (2 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)
Explaining Vision and Language through Graphs of Events in Space and Time
Masala, Mihai, Cudlenco, Nicolae, Rebedea, Traian, Leordeanu, Marius
Artificial Intelligence makes great advances today and starts to bridge the gap between vision and language. However, we are still far from understanding, explaining and controlling explicitly the visual content from a linguistic perspective, because we still lack a common explainable representation between the two domains. In this work we come to address this limitation and propose the Graph of Events in Space and Time (GEST), by which we can represent, create and explain, both visual and linguistic stories. We provide a theoretical justification of our model and an experimental validation, which proves that GEST can bring a solid complementary value along powerful deep learning models. In particular, GEST can help improve at the content-level the generation of videos from text, by being easily incorporated into our novel video generation engine. Additionally, by using efficient graph matching techniques, the GEST graphs can also improve the comparisons between texts at the semantic level.
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
Zhang, Chunhui, Sun, Xin, Liu, Li, Yang, Yiqian, Liu, Qiong, Zhou, Xi, Wang, Yanfeng
Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available.
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)